Multiply-accumulate (MAC) unit for single-instruction/multiple-data (SIMD) instructions

ABSTRACT

A tightly coupled dual 16-bit multiply-accumulate (MAC) unit for performing single-instruction/multiple-data (SIMD) operations may forward an intermediate result to another operation in a pipeline to resolve an accumulating dependency penalty. The MAC unit may also be used to perform 32-bit×32-bit operations.

BACKGROUND

[0001] Digital signal processors (DSPs) may operate as SIMD(Single-Instruction/Multiple-Data), or data parallel, processors. InSIMD operations, a single instruction is sent to a number of processingelements, which perform the same operation on different data. SIMDinstructions provide for several types of standard operations includingaddition, subtraction, multiplication, multiply-accumulate (MAC), and anumber of special instructions for performing, for example, clipping andbilinear interpolation operations.

[0002] Many DSP applications, including many speech codecs, require highperformance 16-bit multiply-accumulate (MAC) operations. To achieve highperformance for these 16-bit DSP applications, 64-bit SIMD instructionsmay be introduced. The 64-bit SIMD instructions may be used to handlemedia streams more efficiently and reduce register pressure and memorytraffic since four 16-bit data items may be loaded into a 64-bitregister at one time.

[0003] While high throughput is an important factor for achieving highperformance, power consumption may also be an important consideration indesigning DSPs for wireless/handheld products. Accordingly, MACarchitectures which are capable of high performance with low powerdemands may be desirable for use in DSPs.

BRIEF DESCRIPTION OF THE DRAWINGS

[0004]FIG. 1 is block diagram of a dual multiply-accumulate (MAC) unitaccording to an embodiment.

[0005]FIG. 2 is a block diagram illustrating a MAC SIMD(Single-Instruction/Multiple-Data) operation according to an embodiment.

[0006]FIGS. 3A to 3C are flowcharts describing a MAC SIMD operationaccording to an embodiment.

[0007]FIGS. 4A to 4C are block diagrams illustrating pipelinedinstruction sequences utilizing data forwarding according to anembodiment.

[0008]FIGS. 5A to 5C are block diagrams illustrating pipelinedinstruction sequences utilizing intermediate data forwarding accordingto an embodiment.

[0009]FIGS. 6A and 6B are flowcharts describing a 32-bit×32-bit MACoperation performed on a tightly coupled dual 16-bit MAC unit accordingto an embodiment.

[0010]FIG. 7 is a block diagram of a mobile video unit including a MACunit according to an embodiment.

DETAILED DESCRIPTION

[0011]FIG. 1 illustrates a Multiply-Accumulate (MAC) unit 100 accordingto an embodiment. The MAC unit 100 may be used to perform a number ofdifferent SIMD (Single-Instruction/Multiple-Data) operations.

[0012] The MAC unit 100 may have a tightly coupled dual 16-bit MACarchitecture. A 16-bit MAC SIMD operation 200 which may be performed bysuch a MAC unit is shown conceptually in FIG. 2. The contents of two64-bit registers, 202 (wRn) and 204 (wRm), may be treated as four pairsof 16-bit values, A₀-A₃ (wRn) and B₀-B₃ (wRm). The first 16 bits tofourth 16 bits of wRn are multiplied by the first 16 bits to fourth 16bits of wRm, respectively. The four multiplied results P₀-P₃ are thenadded to the value in 64-bit register 206 (wRd), and the result is sentto a register 206.

[0013] The MAC operation 200 may be implemented in four executionstages: (1) Booth encoding and Wallace Tree compression of B₁ and B₀;(2) Booth encoding and Wallace Tree compression of B₃ and B₂; (3) 4-to-2compression, and addition of the low 32-bits of the result; and (4)addition of the upper 32-bits of the result. These four stages may bereferred to as the CSA0, CSA1, CLA0, and CLA1 stages, respectively.

[0014]FIGS. 3A to 3C illustrate a flow chart describing animplementation 300 of the MAC operation 200 according to an embodiment.In the CSA0 stage, a MUX & Booth encoder unit 102 selects B₀ (16 bits)and encodes those bits (block 302). Control signals are generated, eachof which select a partial product vector from the set {0, −A₀, −2A₀, A₀,2A₀}. Nine partial product vectors, Pa0 to Pa8, are generated and passedto a MUX array 104 (block 304). All nine partial product vectors and thelow 32 bits of the value in register 206 (wRd) are compressed into twovectors by a Wallace Tree unit 106 (block 306). The two vectors includea sum vector and a carry vector, which are stored in a sum vectorflip-flop (FF) 108 and a carry vector FF 110, respectively.

[0015] A MUX & Booth encoder unit 112 selects B₁ (16 bits) and encodesthose bits (block 308). Control signals are generated, each of whichselect a partial product vector from the set {0, −A₁, −2A₁, A₁, 2A₁}.Nine partial product vectors, Pb0 to Pb8, are generated and passed to aMUX array 114 (block 310). All nine partial product vectors and a zerovector are compressed into two vectors by a Wallace Tree unit 116 (block312). The two vectors include a sum vector and a carry vector, which arestored in a sum vector FF 118 and a carry vector FF 120, respectively.

[0016] In the CSA1 stage, four vectors from the sum and carry vectorsFFs 108, 110, 118, and 120 from the CSA0 stage are compressed intovectors Vs₀ and Vc₀ by a MUX & 4-to-2 compressor unit 122 (block 314).The MUX & Booth encoder unit 102 selects B₂ (16 bits) and encodes thosebits (block 316). Control signals are generated, each of which select apartial product vector from the set {0, −A₂, −2A₂, A₂, 2A₂}. Ninepartial product vectors are generated (block 318). All nine partialproduct vectors and vector Vs₀ are then compressed into two vectors bythe Wallace Tree unit 106 (block 320). The two vectors include a sumvector and a carry vector, which are stored in a sum vector FF 108 and acarry vector FF 110, respectively.

[0017] The MUX & Booth encoder 112 selects B₃ (16 bits) and then encodesthose bits (block 322). Control signals are generated, each of whichselect a partial product vector from the set {0, −A₃, −2A₃, A₃, 2A₃}.Nine partial product vectors are generated (block 324). All nine partialproduct vectors and vector Vc₀ are then compressed into two vectors bythe Wallace Tree unit 116 (block 326). The two vectors include a sumvector and a carry vector, which are stored in a sum vector FF 118 and acarry vector FF 120, respectively.

[0018] In the CLA₀ stage, four vectors from FFs 108, 110, 118, and 120from the CSA1 stage are sent to the 4-to-2 compressor unit 122 togenerate vector Vs₁ and vector Vc₁ (block 327). The lower 32 bits of Vs₁and Vc₁ are added by the carry look-ahead (CLA) unit 124 to generate thelow 32 bits of the final result (block 328).

[0019] In the CLA1 stage, the upper bits of Vs₁ and Vc₁ are signextended to two 32-bit vectors (block 330). The extended vectors and theupper 32-bits of wRd are then compressed into two vectors by a 3-to-2compressor unit 126 (block 332). Two compressed vectors and carry-in bitfrom the CLA0 unit 124 are added together by CLA unit 128 to generatethe upper 32-bits of the final result (block 334).

[0020] As described above, the Booth encoding and vectors compressingtake two cycles to finish. In the first cycle, the results from bothWallace Tree units are sent back for further processing in the secondcycle. Conventionally, all four vectors from FFs 108, 110, 118, and 120would be sent back to the Wallace trees for further processing in thesecond cycle. However, it has been observed that the MUX & 4-to-2compressor unit 122 may perform the 4-to-2 compression of the vectorsfaster than the MUX & Booth encoder units and the MUX arrays. Thus, onlytwo vectors (Vs₀ and Vc₀) from the MUX & 4-to-2 compressor unit 122 aresent back to the Wallace Tree units 106 and 116. With this architecture,the feedback routings may be reduced and the Wallace Tree units 106, 116made relatively smaller. Less feedback routings make the layout easier,which is desirable since routing limitations are an issue in MAC design.

[0021] Some conventional MAC implementations perform the 64-bit additionin one cycle. However, such MACs may not be suitable for a very highfrequency 64-bit datapath, and their results may not have enough time toreturn through the bypass logic, which is commonly used for solving datadependency in pipelining. Compared with conventional architectures, thedual MAC architecture shown in FIG. 1 may be more readily implemented invery high frequency and low power application. The CLA1 stage may haveless logic gates than that of CLA₀ stage, which enables the finalresults to have enough time to return through the bypass logic, makingthis dual MAC architecture suitable for a high speed and low power64-bit datapath.

[0022] The MAC unit may be used in a pipelined DSP. Pipelining, whichchanges the relative timing of instructions by overlapping theirexecution, may increase the throughput of a DSP compared to anon-pipelined DSP. However, pipelining may introduce data dependencies,or hazards, which may occur whenever the result of a previousinstruction is not available and is needed by the current instruction.The current operation may be stalled in the pipeline until the datadependency is solved.

[0023] Typically, data forwarding is based on a final result of anoperation. For many DSP algorithms, the result of the previous MACoperation needs to be added to the current MAC operation. However, a MACoperation may take four cycles to complete, and the result of theprevious MAC operation may not be available for the current MACoperation. In this case, a data dependency called an accumulatingdependency is introduced.

[0024] FIGS. 4A-4C show possible accumulating dependency penalties for astandard data forwarding scheme. The standard forwarding scheme is usedto reduce the accumulating dependency penalty, where EX 402 is theexecution stage for other non-MAC instructions. Even if the standarddata forwarding is employed, an accumulating dependency penalty is stilltwo cycles in the worst case, which is shown in FIG. 4A (note that,although there are three stalls 404 before the final result is availableafter the CLA1 stage, the first stall 404 in FIG. 4A is due to aresource conflict in the Wallace Tree unit, which is not counted as datadependency penalty). Two cycle penalties may be too severe for some DSPapplications, and hence it is desirable to eliminate the accumulatingdependency penalty.

[0025] The MAC unit 100 may be used to implement a new data forwardingscheme, referred to as intermediate data forwarding, which may eliminatethe accumulating dependency penalty. Instead of waiting for a finalresult from a previous operation, the intermediate data forwardingscheme forwards an intermediate result to solve data dependencies. FIGS.5A-5C illustrate the sequences shown in FIGS. 4A-4C, but implementedusing an intermediate data forwarding technique.

[0026] As shown in FIGS. 5A-5C, the CSA0 stage 500 is segmented into twosub-stages 502 (BE0) and 504 (WT0) for Booth encoding and Wallace treecompressing, respectively, operands B₀ and B₁. The CSA1 stage 506 issegmented into two sub-stages 508 (BE1) and 510 (WT1) for Booth encodingand Wallace tree compressing, respectively, operands B₂ and B₃. The CLA0stage 512 is segmented into two sub-stages 514 (4T2) and 516 (ADD0) for4-to-2 compressing of vectors and low 32-bit addition of the finalresult. The CLA1 stage 518 includes the upper 32-bit addition of thefinal result 520 (ADD1).

[0027] In the cases shown in FIGS. 5A and 5B, the low 32-bits ofintermediate vectors Vs, Vc of the first MAC instruction may beforwarded to the Wallace Tree units 106 and 116 for the second MACinstruction to solve the accumulating dependency. The upper 32-bitresult of the first MAC instruction from the CLA1 unit 128 is forwardedto the MUX & 3-to-2 compressor unit 126. The stall 404 in FIG. 5A is dueto the Wallace Tree resource conflict, which is not counted as datadependency penalty.

[0028] In the case shown in FIG. 5C, the final result of the first MACinstruction is not available when it is needed by the second MACinstruction, but the low 32-bit result of the first MAC instruction isavailable. Instead of waiting for the final result, the low 32-bitresult of the first MAC instruction is forwarded to the Wallace Treeunit 106 to solve the accumulating dependency. The upper 32-bit resultof the first MAC instruction from the CLA1 unit 126 is forwarded to theMUC & 3-to-2 compressor unit 128.

[0029] The accumulating data dependency penalty comparisons between thestandard data forwarding technique shown in FIGS. 4A to 4C and theintermediate data forwarding technique shown in FIGS. 5A to 5C are givenin Table 1. As shown in Table 1, intermediate data forwarding mayeliminate accumulating dependencies, which may enable relatively highthroughput for many DSP applications. Penalty for Penalty for Penaltyfor case (A) case (B) case (C) Standard data 2 cycles 2 cycles 1 cycleforwarding Intermediate 0 cycles 0 cycles 0 cycles data forwarding

TABLE 1

[0030] A tightly coupled dual 16-bit MAC unit, such as that shown inFIG. 1, may be used for 32-bit×32-bit instructions as well as 16-bitSIMD instructions according to an embodiment. A 32-bit×32-bit operationmay be divided into four 16-bit×16-bit operations, as shown in thefollowing equation:

A[31:0]×B[31:0]=(A[31:16]×B[15:0]×2¹⁶+A[15:0]×B[15:0])+(A[31:16]×B[31:16]×2¹⁶ +A[15:0]×B[31:16])×2¹⁶.

[0031]FIG. 6 is a flow chart describing a 32-bit×32-bit MAC operation600 according to an embodiment. In the CSA0 stage, the partial productvectors of A[15:0]×B[15:0] are generated by the MUX & Booth encoder unit102 (block 602). The Wallace Tree unit 106 compresses the partialproduct vectors into two vectors (block 604). The two vectors include asum vector and a carry vector, which are stored in the sum vector FF 108and the carry vector FF 110, respectively. The partial product vectorsof A[31:16]×B[15:0] are generated by the MUX & Booth encoder unit 112(block 606). The Wallace Tree unit 116 compresses the partial productvectors into two vectors (block 608). The two vectors include a sumvector and a carry vector, which are stored in the sum vector FF 108 andthe carry vector FF 110, respectively.

[0032] In the CSA1 stage, two vectors from the sum vector FF 118 andcarry vector FF 120 are shifted left 16 bits (block 610). The MUX &4-to-2 compressor unit 122 compresses the shifted vectors and the othertwo vectors from the sum vector FF 108 and carry vector FF 110 intovector Vs₀ and vector Vc₀ (block 612). The low 16 bit of Vs₀ and Vc₀ aresent to the CLA0 unit 124. The remaining bits are sent back to theWallace Tree units 106 and 116. The final results from bit 0 to bit 15are then generated by the CLA0 unit 124 (block 614). The partial productvectors of A[15:0]×B[31:16] and the feedback vector from Vs₀ are thencompressed into two vectors by the Wallace Tree unit 106 (block 616).The two vectors include a sum vector and a carry vector, which arestored in the sum vector FF 108 and the carry vector FF 120,respectively. The partial product vector of A[31:16]×B[31:16] and thefeedback vector from Vs₀ are then compressed into two vectors by theWallace Tree unit 116 (block 618). The two vectors include a sum vectorand a carry vector, which are stored in the sum vector FF 118 and thecarry vector FF 120, respectively.

[0033] In the CLA0 stage, two vectors from the sum vector FF 118 and thecarry vector FF 120 are shifted left 16 bits (block 620). The MUX &4-to-2 compressor unit 122 compresses the shifted vectors and the othertwo vectors from the sum vector FF 108 and the carry vector FF 110 intovector Vs₁ and vector Vc₁ (block 622). The low 16 bits of vectors Vs₁and Vc₁ are added by the CLA0 unit 124. The final results from bit 16 tobit 31 are then generated (block 624).

[0034] In the CLA1 stage, the upper bits (from bit 16 to bit 47) ofvectors Vs₁ and Vc₁ are added by the CLA1 unit 128 to generate the upper32-bit final results (from bit 32 to bit 63) (block 626).

[0035] The MAC unit 100 may be implemented in a variety of systemsincluding general purpose computing systems, digital processing systems,laptop computers, personal digital assistants (PDAS) and cellularphones. In such a system, the MAC unit may be included in a processorcoupled to a memory device, such as a Flash memory device or a staticrandom access memory (SRAM), which stores an operating system or othersoftware applications.

[0036] Such a processor may be used in video camcorders,teleconferencing, PC video cards, and High-Definition Television (HDTV).In addition, the processor may be used in connection with othertechnologies utilizing digital signal processing such as voiceprocessing used in mobile telephony, speech recognition, and otherapplications.

[0037] For example, FIG. 7 illustrates a mobile video device 700including a processor 701 including a MAC unit 100 according to anembodiment. The mobile video device 700 may be a hand-held device whichdisplays video images produced from an encoded video signal receivedfrom an antenna 702 or a digital video storage medium 704, e.g., adigital video disc (DVD) or a memory card. The processor 100 maycommunicate with a cache memory 706, which may store instructions anddata for the processor operations, and other devices, for example, anSRAM 708.

[0038] A number of embodiments have been described. Nevertheless, itwill be understood that various modifications may be made withoutdeparting from the spirit and scope of the invention. For example,blocks in the flowchart may be skipped or performed out of order andstill produce desirable results. Furthermore, the size of the operandsand number of operands operated on per SIMD instruction may vary.Accordingly, other embodiments are within the scope of the followingclaims.

1. A method comprising: performing a first compression operation in afirst multiply-accumulate operation in a pipeline; generating two ormore intermediate vectors in a first compression operation in the firstmultiply-accumulate operation; and forwarding at least a portion of eachof the two or more intermediate vectors to a second multiply-accumulateoperation in the pipeline.
 2. The method of claim 1, wherein saidforwarding at least a portion of each of the two or more intermediatevectors comprises forwarding a lower portions of each of the two or moreintermediate vectors.
 3. The method of claim 1, wherein said performingthe first compression operation comprises compressing a first pluralityof partial products into a first sum vector and a first carry vector andcompressing a second plurality of partial products into a second sumvector and a second carry vector.
 4. The method of claim 1, wherein saidgenerating two or more intermediate vectors comprises compressing thefirst and second sum vectors and the first and second carry vectors intoan intermediate sum vector and an intermediate carry vector.
 5. Themethod of claim 1, wherein said forwarding comprises forwarding at leasta portion of each of the two or more intermediate vectors to a Wallacetree compression unit.
 6. An article comprising a machine-readablemedium which stores machine-executable instructions, the instructionscausing a machine to: perform a first compression operation in a firstmultiply-accumulate operation in a pipeline; generate two or moreintermediate vectors in a first compression operation in the firstmultiply-accumulate operation; and forward at least a portion of each ofthe two or more intermediate vectors to a second multiply-accumulateoperation in the pipeline.
 7. The article of claim 6, wherein theinstructions causing the machine to forward at least a portion of eachof the two or more intermediate vectors include instructions causing themachine to forward a lower number of bits of each of the two or moreintermediate vectors.
 8. The article of claim 6, wherein theinstructions causing the machine to perform the first compressionoperation include instructions causing the machine to compress a firstplurality of partial products into a first sum vector and a first carryvector and compress a second plurality of partial products into a secondsum vector and a second carry vector.
 9. The article of claim 6, whereinthe instructions causing the machine to generate two or moreintermediate vectors include instructions causing the machine tocompress the first and second sum vectors and the first and second carryvectors into an intermediate sum vector and an intermediate carryvector.
 10. The article of claim 6, wherein the instructions causing themachine to forward include instructions causing the machine to forwardat least a portion of each of the two or more intermediate vectors to aWallace tree compression unit.
 11. A method comprising: compressing afirst plurality of partial products into a first sum vector and a firstcarry vector and compressing a second plurality of partial products intoa second sum vector and a second carry vector in a first Wallace treecompression stage of a multiply-accumulate operation; compressing thefirst and second sum vectors and the first and second carry vectors intoa first intermediate sum vector and a first intermediate carry vector;and compressing the intermediate sum vector and a third plurality ofpartial products and compressing the intermediate carry vector and afourth plurality of partial products in a second stage of themultiply-accumulate operation.
 12. The method of claim 11, wherein themultiply-accumulate operation comprises a single instruction/multipledata (SIMD) operation.
 13. The method of claim 11, further comprising:generating the first plurality of partial products from a first pair ofoperands; generating the second plurality of partial products from asecond pair of operands; generating the third plurality of partialproducts from a third pair of operands; and generating the fourthplurality of partial products from a fourth pair of operands.
 14. Themethod of claim 11, further comprising forwarding the intermediate sumand carry vectors to a second multiply-accumulate operation in apipeline.
 15. The method of claim 14, wherein said forwarding compriseseliminating an accumulate data dependency in the secondmultiply-accumulate operation.
 16. An article comprising amachine-readable medium which stores machine-executable instructions,the instructions causing a machine to: compress a first plurality ofpartial products into a first sum vector and a first carry vector andcompressing a second plurality of partial products into a second sumvector and a second carry vector in a first Wallace tree compressionstage of a multiply-accumulate operation; compress the first and secondsum vectors and the first and second carry vectors into a firstintermediate sum vector and a first intermediate carry vector; andcompress the intermediate sum vector and a third plurality of partialproducts and compressing the intermediate carry vector and a fourthplurality of partial products in a second stage of themultiply-accumulate operation.
 17. The article of claim 16, wherein themultiply-accumulate operation comprises a single instruction/multipledata (SIMD) operation.
 18. The article of claim 16, further comprisinginstructions causing the machine to: generate the first plurality ofpartial products from a first pair of operands; generate the secondplurality of partial products from a second pair of operands; generatethe third plurality of partial products from a third pair of operands;and generate the fourth plurality of partial products from a fourth pairof operands.
 19. The article of claim 16, further comprisinginstructions causing the machine to forward the intermediate sum andcarry vectors to a second multiply-accumulate operation in a pipeline.20. The article of claim 16, wherein the instructions causing themachine to forward include instructions causing the machine to eliminatean accumulate data dependency in the second multiply-accumulateoperation.
 21. An apparatus comprising: first and second Wallace treecompression units operative to compress vectors in first and secondstages of a multiply-accumulate operation; a compressor operative tocompress a plurality of vectors output from the first and second Wallacetree units in the first stage of the multiply-accumulate operation intotwo intermediate vectors; and a data path from an output of thecompressor to an input of a multiplexer, said multiplexer operative toselectively input one of said intermediate vectors to one of said firstand second Wallace tree compression units in the second stage of themultiply-accumulate operation.
 22. The apparatus of claim 21, furthercomprising a dual multiply-accumulate unit.
 23. The apparatus of claim21, wherein the plurality of vectors comprise first and second sumvectors and first and second carry vectors.
 24. The apparatus of claim21, wherein the compressor comprises a four-to-two vector compressor.25. The apparatus of claim 21, wherein the multiplexer comprises a firstmultiplexer having an output coupled to the first Wallace treecompression unit and a second multiplexer having an output coupled tothe second Wallace tree compression unit.
 26. A system comprising: astatic random address memory; and a processor coupled to the staticrandom access memory, said processor comprising a dualmultiply-accumulate unit, said unit including first and second Wallacetree compression units operative to compress vectors in first and secondstages of a multiply-accumulate operation, a compressor operative tocompress a plurality of vectors output from the first and second Wallacetree units in the first stage of the multiply-accumulate operation intotwo intermediate vectors, and a data path from an output of thecompressor to an input of a multiplexer, said multiplexer operative toselectively input one of said intermediate vectors to one of said firstand second Wallace tree compression units in the second stage of themultiply-accumulate operation.
 27. The system of claim 21, wherein themultiplexer comprises a first multiplexer having an output coupled tothe first Wallace tree compression unit and a second multiplexer havingan output coupled to the second Wallace tree compression unit.
 28. Amethod comprising: performing a multiply-accumulate operation on firstand second 2n-bit operands as four n-bit operations.
 29. The method ofclaim 28, wherein said performing comprises: generating partial productvectors from the lower n bits of the first operand and the lower n bitsof the second operand; generating partial product vectors from the uppern bits of the first operand and the lower n bits of the second operand;generating partial product vectors from the upper n bits of the firstoperand and the upper n bits of the second operand; and generatingpartial product vectors from the lower n bits of the first operand andthe upper n bits of the second operand.
 30. The method of claim 28,further comprising: compressing the partial products generated from theupper n bits of the first operand and the lower n bits of the secondoperand into two intermediate vectors; and shifting the intermediatevectors left by n bits.
 31. The method of claim 28, wherein saidperforming comprises performing the multiply-accumulate operation on atightly coupled dual n-bit multiply-accumulate unit.
 32. The method ofclaim 28, wherein n equals sixteen.
 33. An article comprising amachine-readable medium which stores machine-executable instructions,the instructions causing a machine to: perform a multiply-accumulateoperation on first and second 2n-bit operands as four n-bit operations.34. The article of claim 33, wherein the instructions causing themachine to perform includes instructions causing the machine to:generate partial product vectors from the lower n bits of the firstoperand and the lower n bits of the second operand; generate partialproduct vectors from the upper n bits of the first operand and the lowern bits of the second operand; generate partial product vectors from theupper n bits of the first operand and the upper n bits of the secondoperand; and generate partial product vectors from the lower n bits ofthe first operand and the upper n bits of the second operand.
 35. Thearticle of claim 33, further comprising instructions causing the machineto: compress the partial products generated from the upper n bits of thefirst operand and the lower n bits of the second operand into twointermediate vectors; and shift the intermediate vectors left by n bits.36. The article of claim 33, wherein the instructions causing themachine to perform includes instructions causing the machine to performthe multiply-accumulate operation on a tightly coupled dual n-bitmultiply-accumulate unit.
 37. The article of claim 33, wherein n equalssixteen.